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Abstract 

Statistical post-processing of dynamical forecast ensembles is an essential compo- 
nent of weather forecasting. In this article, we present a post-processing method that 
generates full predictive probability distributions for precipitation accumulations based 
on ensemble model output statistics (EMOS). 

We model precipitation amounts by a generalized extreme value distribution that is 
left-censored at zero. This distribution permits modelling precipitation on the original 
scale without prior transformation of the data. A closed form expression for its contin- 
uous rank probability score can be derived and permits computationally efficient model 
fitting. We discuss an extension of our approach that incorporates further statistics 
characterizing the spatial variability of precipitation amounts in the vicinity of the 
location of interest. 

The proposed EMOS method is applied to daily 18-h forecasts of 6-h accumulated 
precipitation over Germany in 2011 using the COSMO-DE ensemble prediction sys- 
tem operated by the German Meteorological Service. It yields calibrated and sharp 
predictive distributions and compares favourably with extended logistic regression and 
Bayesian model averaging which are state of the art approaches for precipitation post- 
processing. The incorporation of neighbourhood information further improves predic- 
tive performance and turns out to be a useful strategy to account for displacement 
errors of the dynamical forecasts in a probabilistic forecasting framework. 



1 Introduction 



In recent years, weather prediction has seen a culture change towards probabilistic fore- 
casting. In order to represent forecast uncertainties, ensembles of dynamical forecasts are 
generated with members corresponding to model integrations that differ in t he initial condi- 
tions and/or the numerical representation of the atmosphere (jPalmerl . |2002| ). These are the 
main sources of uncertainty, but the different ensemble members still share certain struc- 
tural model deficiencies and usually fail to represent the full uncertainty that comes with 
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numerical weather prediction. Statistical post-processing has therefore become an integral 
part of any ensemble prediction system, aiming to rem ove systematic biases and to achieve 
appropriate representation of the forecast uncertainty (iGneiting and Rafteryl . 120051 ) . 

Among the various approaches to statistical post-processing, methods that transform the 
ensemble forecasts into a full predictive cumulative distribution function (CDF) function are 
particularly convenient, because all kinds of probabilistic statements (prediction intervals, 
probabilities of threshold exceedance, etc.) can be derived from this CDF in a consistent 
way. Finding a suitable probabilistic model is more challenging for precipitation than for 
most other weather variables because the associated uncertainty calls for rather special dis- 
tributions with the following features: 



• they must be non-negative 

• they may be equal to zero with positive probability 

• their non-zero component has positive skew 



Several authors ( Hamill and Colucci . 1997 : Sloughter et al. . 2007 : Wilks, 20091 ) report an 
improved fit if their models are fitted to powers (typically between 0.25 and 0.5) of forecasts 
and observations, which suggests that the predictive distribution for precipitation amounts 
on the original scale has a heavy right tail. An inconvenient side effect of such power trans- 
formations is, however, that they distort the original scale, accentuating ensemble spread at 
low precipitation levels and attenuating it at higher ones. To avoid this effect, we will use a 
family of distributions that can be used directly with the untransformed data. 



We adopt the paradigm formulated by lBrocker and Smith! (120081 ) . that post-processing should 
make optimal use of the information contained in the ensemble without relying on any as- 
sumption about ensemble forecasts being draws from some unknown distributions. Our aim 
is to develop a model that is of comparable conceptio nal simplicity and computational effi- 
ciency as the extended logistic regression approach by IWilkd ( l2009l ). but in addition permits 
the inclusion of uncertainty information from the ensemble in a natural and in tuitive way. To 



achiev e this, we adapt the non-homogeneous Gaussian regression approach by IGneiting et al. 



( 120051 ) so as to respect the pecuharities of precipitation. In Sec. |2]we introduce an adaptation 
of the generalized extreme value (GEV) distribution family, and argue that it presents, for 
reasonable choices of the shape parameter, an ideal candidate for predictive distributions 
for quantitative precipitation. We further discuss suitable predictor variables for the GEV 
location and scale parameter that convey the relevant information of the ensemble. A closed 
form expression for the continuous rank probability score (CRPS) of the adapted GEV is 
provided and will be used for model fitting. In Sec. |31 the issue of displacement errors of 
dynamical precipitation forecasts is addressed. These errors are another peculiarity of pre- 
cipitation and a further source of uncertainty. This uncertainty can be taken into account 
by our EMOS method by considering, for each location of interest, the dynamical forecasts 
in a larger neighbourhood of this location, and condensing this neighbourhood information 
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into a further predictor. In Section H] we apply our method to daily 18-h forecasts of 6-h 
accumulated precipitation over Germany in 2011 generated by the COSMO-DE ensemble 
prediction system from the German Meteorological Service. A discussion of possible further 
extensions to our approach is subject of Sec. [51 

2 A distribution family for quantitative precipitation 

Consider the CDF of the generalized extreme value (GEV) distribution 



which is a family of continuous distributions with parameters /i, a and ^ that characterize 
location, scale and shape of the GEV. For ^<0,?/>/i — | one defines Giy) := 1, for 
^ > 0,y < fi — J one defines G{y) := 0. In this paper we will always assume ^ e (—0.278, 1), 
because for this range of values, the GEV has positive skew and its mean exists and is equal 
to 



where F denotes the gamma function and 7 ~ 0.5772 is the Euler-Mascheroni constant. To 
obtain a suitable model for precipitation amounts, we consider the GEV to be left-censored 
at zero, i.e. all mass below zero is assigned to exactly zero. The predictive CDF then becomes 



This distribution is non-negative and exactly zero with positive probability if either ^ < 
or ^ > and /i < |. For ^ > it has a heavy right tail, and so it accommodates all of 
the important features of a predictive distribution for quantitative precipitation forecasting 
mentioned above, and can be used directly with the untransformed data. To illustrate how 
the left-censored GEV distributions look in practice. Fig. [1] shows the predictive density of 
6-h precipitation accumulations on 11 June 2011 at various locations in Germany. Details 
about the data and the forecast ensemble are given later in Sec. HI 

Linking the parameters to ensemble model output statistics 

The next step after specifying a suitable family of predictive distributions is to link its 
parameters to suitable predictor variables. Although m is no longer the mean of the left- 
censored GEV, it is a more suitable location parameter for our purposes than n, since it 
interacts more naturally with a. Indeed, if for fixed /i the scale parameter a is increased. 






for y > 
for y < 



3 




Figure 1: Predictive distributions of precipitation accumulations on 11 June 2011 between 1200 
UTC and 1800 UTC at Stuttgart airport, Munich airport, Cologne/Bonn airport, and Munich 
city (from left to right). The short blue lines below the densities represent the ensemble member 
forecasts. 



the whole mass of the (left-censored) GEV distribution is shifted to the right. If m is kept 
fixed instead, then an increase of a spreads the mass of the distribution more symmetrically 
to both sides, which corresponds much better to the idea of increased uncertainty. Now let 
/si, • • • , fsK be the ensemble member forecasts of precipitation amounts at location s. Their 
information will be condensed into the following statistics: 

• Is := ^ Y.k=i fsk (ensemble mean) 

• l{fs=o} '■= Ylik=i ^{fsk=Q} (fraction of zero precipitation members) 

• MD{is) := Yl,k,k'=i \fsk - fsk'l (ensemble mean difference) 



While is is a standard predictor for the location parameter and l{fs=o} (or v ariants of it) can 



provide additional information about whether precipitation occurs or not ( ISloughter et al. 



20071 : iBentzien and Friederichsl . |2012| ). MD{{s) has not been employed in this context so 



far. It has certain properties, however, that m ake it an appealing dispersion measure for 
precipitation forecasts (see also lYitzhakil - liooih : 



• it is more robust than the standard deviation because it uses absolute rather than 
squared differences 

• unlike the interquartile range or the median absolute deviation it is sensitive to all 
ensemble forecasts 



We let the parameters and as depend on the ensemble forecasts via 



rris = ao + ai ■ fs + ■ l{f,=o}, c^s = /3o + A ■ MD{fs)- 
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Model fitting 



Given the ensemble member forecasts for some location, a predictive CDF for this location 
can be issued once suitable parameters aQ,ai,a2, /3o, (3i and ^ have been selected. This is 
done based on a training set of forecasts-observations pairs. For each day, we let this training 
set consist of data from the preceding n days and all rain-gauge locations within our domain 
of interest. 



Proper scoring rules (iGneiting and Rafteryl . 120071 ) are a standard tool for quantitative as- 
sessment of the quality of probabilistic forecasts. A scoring rule s{F,y) assigns a numerical 
score to each pair {F, y), where F is the predictive CDF and y is the verifying observation. If 
negatively oriented scores are considered, a lower score indicates a better probabilistic fore- 
cast, where "better" refers to both calib ration and sharpness. These two properties should 



be the goal of probabilistic forecasting (IGneiting et al.l . 120071 ). and so it is natural to use 



strictly proper scoring rules as loss functions for training algorithms. With T denoting the 
set of training days, S the set of training sites, and its cardinality, the parameters are 
chosen such that the empirical score 
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teT s€S 



is minimized. When it comes to choosing a specific scoring rule that is adequate in the 
present framework, the following aspects should be taken into account: 

• even state of the art high-resolution limited-area mesoscale NWP models such as the 
COSMO-DE-EPS frequently issue forecasts that are - at least when considered grid- 
point by gridpoint - quite off the mark. The chosen scoring rule should therefore be 
reasonably robust and sh ould not be corrupt e d by a few "bad forecasts" (see also the 



discussion in Section 5 of iBrocker and Smithl . 120081 ) ; 



• the chosen scoring rule should be able to deal with the fact that the predictive distri- 
bution for precipitation has both a discrete and a continuous component. 

Both of these aspects make a strong case for the continuous ranked probability score 

/oo 
(F(t)-l[,,oo)(t))'cit (2) 
-oo 



(IHersbachl . l2000l : IGneiting and Rafteryl . 120071 ). A training algorithm based on minimum 
CRPS estimation typically requires many evaluations of ([T]) which can be computationally 
intractable if the integral in ({2]) has to be calculated numerically. However, in a recent paper 
Friederichs and Thorarinsdottirl ( l2012l ) derived a closed form expression for CRPS((j, y), and 
it is straightforward to generalize their calculations to the case of a left-censored GEV. For 
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^ 7^ one obtains 



cr 



CRFS{G^^o,y) = {fi - y){l - 2py) + i^p', - 2 - 1 - py -Ti{l ~ ^, -\ogpy) (3) 



l-p2-2«r,(l-e,-21ogpo) 



where 

Po:=G{0), Py:=G{y), 

and Ti denotes the lower incomplete gamma function. A closed form expression for CRPS(Gg=0) 
can be derived as well, but for numerical reasons it is preferable to use, for ^ G (— e, e) with 
e reasonably small, the approximation 



CRPS{G^=o,y) 



(£zi) 

2e 



CRPS(G5=_„1/) + ^ CRPS(G^=„i/). 



where the two scores on the right hand side are calculated according to ([3]). Since is the 
product of a gamma function and the CDF of the gamma distribution, all of the above terms 
can be calculated with standard statistical software. Owing to this closed form, minimum 
CRPS estimation is computationally efficient and feasible even for large training sets. 



Choice of the training period and regularization 

In our experiments with the COSMO-DE-EPS forecasts in Sec. H] we use a rolling training 
period of n = 30 days, i.e. at each verification day we fit our model to forecast-observation 
pairs from the preceding 30 days (as we consider lead times of 18h and less, this data 
would be available at the time where the new forecast is issued). One therefore obtains 
a different post-processing model for each verification day, which allows one to adapt to 
seasonal changes. The choice of the training period implies a bias/variance tradeoff. Longer 
periods imply more training data and lead to more stable parameter estimates, while shorter 
periods permit a more flexible reaction to seasonal changes. With our choice of n = 30 days 
we ensure there are a sufficient number of wet days, but nevertheless some of the parameter 
estimates show unrealistically strong fluctuations over time (see Fig. |2]). This suggests that 
a longer training period could be favourable, but instead of tuning n to our speciflc data set, 
we investigate a different strategy to stabilize the estimates. 

Generally, we expect the post-processing parameters to vary over time, but in such a way 
that changes are incremental. Because of this, it certainly makes sense to use the parameters 
chosen for one day as starting values for the training algorithm on the next day. Given 
the new training data, gradient-based optimization routines adjust these initial parameters 
typically by flrst taking a step in the direction of the steepest descent of the objective function 
and then proceeding with further reflnements until the reduction in the objective is within a 
specifled tolerance. To avoid overfltting, we use a regularization strategy which is referred to 
as "early stopping" in the machine learning and statistics literature. Instead of iterating the 
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Figure 2: Temporal evolution of the location parameters (top left), ai (top), and 02 (top 
right), the scale parameters (3q (bottom left) and /3i (bottom), and the shape parameter ^ (bottom 
right) obtained with standard CRPS minimization (solid lines) and early stopping (dashed lines). 



optimization routine to convergence, one stops after a few iterations. The rationale behind 
this strategy is that the important adjustments to the parameters are made during the first 
steps, while further adjustments often improve the fit to unimportant or even random features 
in the data only. Following this idea, and assuming that the parameters of the preceding 
day are excellent starting values, we stop numerical optimization after just one iteration of 
the quasi-Newton Broyden-Fletche r-Goldfarb-Shanno (BFGS) algorithm implemented in R 
f R Development Core Team . 2010l ). Note that this early stopping implies that the CRPS 



minimization based on the training data is incomplete. It turns out, however, that the 
verification results obtained with the regularized parameter estimates are even better than 
those obtained without regularization. More importantly, the temporal evolution of the 
different parameters becomes much smoother and hence more realistic (see Fig. |2]). On the 
first verification day, where we need to guess a set of initial parameters, we let the BFGS 
algorithm perform ten iterations to permit it get away from the starting values. We take these 
to be (tto, «!, 02) = (0, 1, —1), optimistically suggesting that closely follows the ensemble 
mean with no need for and intercept and a assuming a moderate negative contribution of 
l{f^=o}- Likewise we take (/3o, /9i) = (0.1, 1), this time allowing for a small intercept to ensure 
the variance of G is positive. We have no intuition about the shape parameter and start with 
,^ = 0. In order to make sure that /3o > 0, /3i > and G (—0.278, 1) on all verification days, 
we alter the objective function ([T]) so as to return a high enough value (we take twice the 
value of the last "admissible" evaluation of the objective) whenever some parameter violates 
the above conditions. In our experimental runs, however, it never occurred that the BFGS 
algorithm attempted to evaluate the objective function with a parameter set outside the 
admissible range. 

Fig. [2] shows the temporal evolution of all parameters over our 365-day verification period. 
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Apparently, some of the parameters are much more prone to overfitting than others. The 
parameter ai, for example, which relates the ensemble mean to the location parameter m, 
is certainly the most influential of all parameters, and is fitted in a more or less stable way 
even if the training algorithm is iterated to convergence. The fraction of zero precipitation 
members, on the contrary, while providing additional information for distinguishing wet 
and dry regimes, has far less impact on the CRPS. As a consequence, the estimates of 
a2 are much more volatile, but can be stabilized by the proposed strategy. All remaining 
parameters are somewhere in between, and early stopping smoothes their temporal evolution 
while maintaining their seasonal cycles. Overall we conclude that early stopping yields 
- in addition to the computational speed-up - the benefit of avoiding overfitting of the 
post-processing parameters, and can thus e ven improve p redictive performance. A similar 
conclusion was noted in a different context bv lHaniilll (120071). who criti cized the EM algorithm 
used in the Bayesian model averaging technique (iRaftery et al.l . |2005| ) for assigning radically 
unequal weights to different ensemble members, and explained this as well by overfitting. 



3 Addressing displacement errors by using neighbour- 
hood information 

The method presented in Sec. [2] can be used right away to turn ensemble forecasts into a 
predictive distribution, but there is another peculiarity about precipitation that a good post- 
processing method should take into account: the issue of displacement errors. Precipitation 
results from complex dynamical and microphysical processes, and precipitation forecasts will 
strongly be affected by uncertainties in the forecasts of many other variables in the NWP 
model. Underestimating the velocity of an approaching front, for example, will entail misal- 
location of forecast precipitation amounts at locations around the true and the forecast front 
position at the considered time. Even more challenging is the situation where precipitation 
results from locally forced convection, and predictability is limited by the short time scales 
of these processes. In both cases, the quantitative precipitation forecasts at specific locations 
by the NWP model may be quite far off, even if the dynamics are captured well qualitatively 
- the "correct" precipitation amount may simply be displaced by a certain distance. To 
address such displacement errors we consider ensemble forecasts in a neighbourhood A/'(s) 
of radius r around the location s of interest. More specifically, we pass from /^i, . . . , fsx to 
weighted averages of the forecasts at all gridpoints x within Af{s): 

fms),k ■■= '^x^f^k, k = l,...,K. 

Rather than using equal weights wi^^ within the neighbourhood, we let them decay smoothly 
with distance from s 

«;W~max|l-f^!^V,0 
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and rescale them such that X]zeA/'(s) '"^^''^ ~ '^^^ rationale behind this choice is as follows: 
on the one hand we think that due to displacement, the ensemble forecasts at some point x 
near s might be closer to the truth than those at s itself. On the other hand, we still think that 
the further we move away from s, the less likely it is, in absence of further information about 
a possible displacement, that the respective forecasts are more appropriate. In principle, this 
idea could be generalized and elliptic neighbourhoods could be used in order to account for 
anisotropy in location uncertainty. Or, if systematic displacement errors are observed, this 
could be corrected by shifting the centre of J\f{s). Both extensions are likely to yield further 
improvement, but linking shapes and shifts of the neighbourhood to suitable covariates is 
all but straightforward, and for simplicity we shall stay for now with radially symmetric, 
unshifted neighbourhoods. The original predictors from Sec. |2] become 

• fjv{s) ■= J2k=i fM{s),k (mean weighted neighbourhood average) 



(weighted fraction of zero precipitation members) 

• M£'(fv(s)) := -^Ylk,k'=i \ fN{s),k - fN{s),k'\ 
(mean diff. of weighted neighbourhood averages) 

Their re-definition so far essentially replaces the original forecasts by smoothed forecasts. 
For the third predictor this can also be a drawback: it still carries information about fore- 
cast uncertainty due to uncertain initial conditions and model physics, but displacement 
uncertainty represented by the ensemble may even by partly smoothed out. We therefore 
add a further predictor 

MD^ism ■■= j^Y. E ^'^^^x'\f^k-fx'k\ 

k=l x,x'&M(s) 

(mean neighbourhood weighted mean differences) which addresses uncertainty due to spatial 
variability of precipitation forecasts, but averages over the different ensemble members. The 
calculation of the double sum can be compu tationally expensive for large neighbourhoods, 
but by comparing equations (19) and (20) in lHersbachI fl2000h one obtains 



N-l 



E ^^''^x'\u-fx'k\ = 2Y,y^^'\^-y^'^'^){u.k-u) (4) 

x,x'&J\f{s) i=l 

where xi,...,xn is, for any fixed k, an enumeration of all gridpoints in A/'(s) such that 
fxik < • • • < fxpfk and W*-*-' = J2]=i'^^^j'- The computational co sts for sorting are negligible 



if efficient sorting algorithms such as quicksort or heapsort (see iPress et al.l . Il989l ) are used 



and so passing from a double to a single sum can reduce computational time considerably. 
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Figure 3: Standardization factor Cs (left), ensemble mean of predicted precipitation for 12 Decem- 
ber 2011 between 1200 UTC and 1800 UTC (middle left), and mean weighted 60 km-neighbourhood 
averages calculated on the original (middle right) and on the standardized (right) scale. 

The location and scale parameters m and a of our left-censored GEV are linked to the above 
statistics via 

m = ao + ai ■ fAr(s) + "2 ■ lAr(s),{f,=o} 
a = /3o + /3i ■ Mi}(f^(,)) + /32 ■ MD^^,^{Q 

As before, the parameters ao,ai,a2, (3q, f3i, (32 and ^ are fitted to training data via CRPS 
minimisation. 

The choice of the neighbourhood radius r certainly depends on both the forecast ensemble 
and the topography of the considered domain. Large neighbourhoods can make allowance for 
large displacement errors, but entail a blurring of precipitation processes thus reducing the 
spatial resolution of forecasts. One particular aspect of this is that orographically induced 
precipitation will be spread out to locations which are close-by but have an entirely different 
topography. At least this aspect can be accounted for by "standardizing" predicted precipi- 
tation amounts to make them independent of the orography, calculating the above predictor 
variables with the standardized forecasts, and finally transforming the resulting predictors 
back to the original, orography-dependent scale. As an indicator for orography- related spa- 
tial variations in precipitation patterns we define Cg as the mean annual precipitation amount 
at gridpoint s divided by the median of all of these values over the considered domain. We 
then replace f^k in the definitions of fv(s) and MD_\r(s)(fx) by the standardized forecasts 
fxk '■= fxk/cx and use 

Cs -i^Mis), c, ■MZ)(f^(,)), and Cs-MD^^^ 

as predictors for the parameters m and a (the remaining statistic, lAr{s),{f^=o}, is let un- 
changed). To calculate the standardization factor Cg we used measurements of mean annual 
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precipitation at rain gauges in Germany, Austria and Switzerland from the years 2005 to 
2010. We fitted a 3D spatial Gaussian process model to these observati ons and interpolated 
them to the forecast grid (2.8 km horizontal resolution) via Kriging (cf. I Chiles and Delfineii . 
[l999i), a technique for spatial interpolation similar to what is referred to as "objective analy- 
sis" in the meteorological literature. The fitted 3D model describes how correlations between 
observations decay both horizontally and vertically. It thus provides a measure of similarity 
between different sites which is the starting point of the Kriging interpolation scheme. As the 
topography of Germany gets increasingly complex as one moves from north to south (starting 
with coastal areas, followed by the North German Plain, then by an area with various low 
mountain ranges, and finally the alpine foothills), additional data from the southern neigh- 
bours, Austria and Switzerland, was required to ensure one obtains reasonable interpolates 
in the alpine regions. 

Figure |3] illustrates the difference between the climatology-corrected and the uncorrected 
neighbourhood averaging scheme with precipitation forecasts over Germany. The standard- 
ization factor Cs is shown in first plot. The second plot depicts the ensemble mean of 
predicted precipitation on a day where elevated precipitation levels are forecast in particular 
over several alpine regions. Black Forest (south-west) and, to a lesser extent, also some other 
mid-range mountains. If mean weighted neighbourhood averages with a neighbourhood ra- 
dius of 60 km are considered instead of the ensemble mean, orographic effects are largely 
smoothed out (third plot). The corresponding statistics with the same neighbourhood size 
but climatology-correction account for displacement errors while keeping high precipitation 
levels linked to mountainous topography (fourth plot). 



4 Post-processing of COSMO-DE-EPS forecasts of pre- 
cipitation accumulations over Germany in 2011 

Forecast and observational datasets used 

We test the EMOS post-processing method introduced in Sections [2] and [3] with forecasts 
of 6-h accumulated precipitation from the COSMO-DE-EPS and rain gauge observations 
from about 1130 SYNOP stations in Germany. COSMO-DE-EPS is a multi-analysis and 
multi-physics ensemble prediction syst em based on the hig h-resolution numerical weather 



prediction (NWP) model COSMO-DE flBaldauf et all . 1201 ih . a configuration of the COSMO 



model with a horizontal grid size of 2.8 km operated by the German Meteorological Service 
(DWD). It is in pre-operational phase since 9 December 2010, covers the area of Germany 
and produces forecasts with lead times up to 21 hours. The current setup of the lateral 
boundary conditions uses forecasts from different global models, while different configura- 
tions of the COSMO-DE model are used for the variation of model physics. Deep convection 
is simulated explicitly, and the perturbations of the model physics are tail ored to represent 



uncertainties in quantitative precipitation forecasts (for further details see iGebhardt et al 
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20111). Preliminary studies show that ensemble precipitation forecasts by COSMO-DE-EPS 



are just slightly under dispersive and have a moderate tendency to overforecast precipitation 
amounts. Despite their good performance we think that probabilistic quantitative precipi- 
tation forecasts can be further improved by post-processing, especially during the summer 
months where precipitation is often due to locally forced convection with small convective 
cells whose highly random appearance can hardly be represented by only 20 ensemble mem- 
bers. We consider daily forecasts of precipitation amounts between 1200 UTC and 1800 
UTC by the COSMO-DE-EPS model run initialized at 0000 UTC. The whole year 2011 will 
be used as a verification period, whence the respective training periods are a bit shorter than 
30 days during the first few days. 



Forecast verification techniques 



As a measure of predictive performance we use proper scoring rules (jJolliffe and Stephensonl . 



20031 : IWilksl . l2006l : iGneiting and Rafteryl . 120071 ) which address both calibration and sharpness 



of the probabilistic forecasts simultaneously. In addition to the CRPS, introduced in Sec. [H 
we consider Brier scores for the binary event o that the observed precipitation amount y 
exceeds a certain threshold t. If p = 1 — F{t) denotes the predicted probability of this event, 
the Brier score is defined as 

BS{p,o)={o-pf ={F{t)-l^y^^){t)f 

Brier scores are useful to check, for example, if probabilistic forecasts for high precipitation 
levels are reliable and issued with reasonable resolution, whereas the CRPS, being an integral 
over the BS at all thresholds, measures the overall performance. Since the frequency of 
observed I's rapidly declines with increasing thresholds, the corresponding values of the BS 
have entirely different magnitudes. To facilitate a comparison, it is convenient to consider 
the Brier skill score 

BS 



ESS = 1 - 



BSref 

which is positively oriented and can be interpreted as the improvement over a reference 
forecast. In our study the reference forecast at each site s will be the fraction '^{f,k>t} 
of ensemble forecasts above the threshold. In the same way, we will employ the continuous 
ranked probability skill score (CRPSS) which is defined analogously to the BSS, and uses 
the CRPS of the raw ensemble forecasts 

K K 

CRPS(F,„.,1/) = -^1/^-2/1--^ \fk-fk'l (5) 

k=l k,k'=l 

where Fens = '^'^k=iMfk,oo){i)y as a reference score. Expression (l5 l) is obtained from the 
alternative representation CRPS(F, y) = Fjp\X—y\ — ^Ep\X—X'\ (see 



Gneiting and Raftery 



20071 ) by plugging in the empirical CDF of the ensemble forecasts. 
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Table 1: Brier and CRP skill scores for the different post-processing methods. 





BSS (0 mm) 


BSS (5 mm) 


BSS (10 mm) 


BSS (15 mm) 


CRPSS 


extended LR 


0.079 


0.044 


0.048 


0.044 


0.052 


BMA 


0.042 


0.008 


0.022 


0.031 


0.024 


EMOS 


0.055 


0.046 


0.057 


0.040 


0.054 



In order to check whether possible differences in skill are significant we use statistical tests 
for equal performance. We will focus on the CRPS (which measures overall perfor mance) 
and p rovide p- values obtained from paired t-tests. Following the recommendations of iHamill 
( 1l999l ). tests are performed using differences in daily sums of CRPS rather than differences 
in daily CRPSSs. The aggregation of scores to daily sums accounts for spatial dependence 
wh ile scores are c onsidered independent from one day to the next, an assumption supported 
by iHamilll ( 119991 ) . Table 3, and the fact that the accumulation periods considered here are 
18h apart. 

To assess the calibration of the probability over threshold forecasts by the raw ensemble 
on t he one hand and our EMOS approach on the other hand, we use reliability diagrams 
Ch. 7) enhanced with uncertainty information. The forecast probabilities 



Wilks. 2006 



le.g. 

are divided in 11 categories defined as follows: 

5i := [0, 0.05), B2 := [0.05, 0.15), . . . , 5n := [0.95, 1.0]. 

A forecast is reliable if the relative frequency of the event o, = 1, when computed over all 
instance s i for which p,; fall s into the interval B^, is equal to the mean vf^ of Pi over that 
interval (IBrocker and Smith! . 120071 ) . Especially for the first bin and high threshold values, Ttk 
can be quite different from the arithmetic centre of Bk, and so we follow iBrocker and Smith 
( I2OO7I ) and plot the observed relative frequencies for a bin Bk versus vf^. To illustrate the 
sharpness of the probabilistic forecasts we add histograms for the frequency of usage of 
the different bins. High-probability forecasts of heavy precipitation amounts were issued 
very infrequently, and so we plot the inset histograms on a log- 10 scale, providing a better 
visualization of the distribution in the tails. 

Reliability diagrams allow one to graphically assess both reliability and resolution of proba- 
bilistic forecasts. A q uantitative mea sure for these characteristics is obtained from the Brier 
score decomposition (IMurphyl . Il973l ). We add this information to our diagrams, using a 
slightly different decomposition proposed by iFerro and Frickerl (120121 ) which is less biased in 
finite samples and therefore provides a more accurate measure of performance. The observed 
relative frequencies displayed in the reliability diagram are subject to substantial sampling 
variability, and so are the estimated REL, RES, and UNC components of the Brier score. 
To extrac t an estimate of th e uncertainty in these statistics we use bootstrap resampling 
similar to iHamill et al.l (120071 ). Accounting again for dependence of forecast errors in space 
but assuming independence in time, 90% confidence intervals for the observed relative fre- 
quencies on the one hand and REL, RES, and UNC on the other hand were estimated from 
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a 100 0-member block bootstrap sample (following lEfron and Tibshiranil . Il993l and iHamilll . 
19991 ) and placed into the reliability diagrams. 



Comparison with state of the art post-processing approaches 



We first compare the performance of our EMOS method from Se c. M with that of ex t ended 
logistic regression (LR) and Bayesian model averaging (BMA). ISchmeits and Kokl (120101 ) 
compared the latter two approaches using ECMWF ensemble precipitation re-forecasts over 
the Netherlands for medium to long forecast ranges, and found that after some modification 
of BMA both methods performed similar. The present setup with short-range forecasts 
from a high-resolution NWP model may, however, present rather different challenges and 
the conclusions concerning the performance of the two methods need not be the same. 

Extended LR was proposed by IWilkd (l2009f ) and fits a censored logistic distribution with a 
mean related to the (power-transformed) ensemble mean, to power transforms of observed 
precipitation amounts. At its core it is an EMOS approach similar to ours, but the model 
is rewritten such that the model fitting can be done within a logistic regression framework. 
Each observation ys is turned into a number of binary responses corresponding to exceedance 
of certain climatological quantiles specific to site s. By including a function g{q) of the respec- 
tive quantile in the set of predictors, the LR fit can be done for all quantiles simultaneously, 
and the fitted n aodel c o rresp onds to a full predictive distribution. We implement this method 
as described in IWilkd (120091 ) with climatological quantiles derived from rain gauge observa- 
tions between 2005 and 2010. Since the quantiles qn m, qn.i , q n..s.s an d go.5 are equal to zero 



for almost all stations, and the other quantiles used in IWilksl (120091 ) are relatively small, we 
use go.98 as an additional threshold. After communication with Zied Ben Bouallegue f rom 
Deutscher Wetterdienst, who uses a very similar forecast setup ( jBen Bouallegud . 120121 ) we 
model square roots of precipitation accumulations, i.e. g{q) = 62\/^, and use the fourth root 
of the ensemble mean as a predictor. The latter yields a considerable improvement in our 
setup compared to the original choice. 

BMA was first d eveloped for ternperatu re (IRaftery et al.l . |2005| ). a variant for precipitation 



was proposed by ISloughter et al.l (120071 ). Each ensemble member is associated with kernel 
density Pk{ys\fsk), which for precipitation consists of a discrete (point mass at zero) and 
a continuous (gamma distribution) component which is fitted to cube root y of observed 
precipitation amounts. The final predictive density is then a mixture 



K 



p{ys\fls,---,fKs) = ^WkPkiVslfks) 



k=l 



of the members' individual densities with weights Wi, . . . , wr- that reflect each m ember's skill 



BMA is implemented in the 'ensembleBMA' package in R (IFraley et al.l . 1201 ll ). an d we use 



it wit h the control options tol=le-4, which stabilizes the estimates of the weights (IHamilll . 
20071 ). and power=0.5, which gives a much better fit for our data set than the default cube 
root transformation. 
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Figure 4: Temporal evolution of the location (left) and scale (middle) parameters, and of the shape 
parameter ^ (right) for EMOS using a 60 km neighbourhood, ai and /3j, i = 1,2,3 are plotted as 
black solid lines {i = 1), red dashed lines {i = 2) and green dotted lines (i = 3), respectively. 

Extended LR and BMA were both fitted with the same training sets as our EMOS method, 
consisting of observations of all available stations with identical parameters and the 30 
training days preceding every verification day. Brier skill scores for different threshold values 
and CRPSSs are shown in Tabled] All three methods improve on the probabilistic forecasts 
derived from the raw ensemble with the performance of extended LR and EMOS being 
approximately equal and BMA a little bit behind. Is the improvement of our method over 
the raw ensemble and the reference methods statistically significant or may it simply be a 
lucky coincidence? To answer this question we tested for equal performance of EMOS and 
its competitors and obtained the p-values: 

ensemble: < 0.01, ext. LR: 0.63, BMA: < 0.01 

These values suggest our method attains significantly better CRPSs than the raw ensemble 
and BMA, while the improvement over extended LR is not significant. A thorough analysis 
of reliability diagrams (not presented here) and the corresponding Brier score decomposition 
shows that BMA tends to underpredict precipitation amounts at medium to high levels. 
Extended LR predictions are more reliable than the ones by EMOS, but the latter have 
better resolution. This is plausible since our method is based on the untransformed ensemble 
mean and therefore stays closer to the raw ensemble forecasts. Adjustments of our predictors 
might well improve reliability and even the overall skill, but we contend that extensions along 
the lines of Sec. E] are a preferable means to improve reliability, and may be able to achieve 
this without trading such improvement for lower resolution. That this is indeed the case will 
be shown in the next subsection. 



The benefit of using neighbourhood information 



The general idea of considering foreca sts from a larger ne ighbourhood around the location of 
interest has been discussed previously. iTheis et al.l (|2005[ ) use deterministic forecasts within a 
spatio-temporal neighbourhood as a substitute f or a forecast ensemble and de rive probabilis- 
tic statements from it. S imilar ideas are used bv|Ben Bouallegue et al.l (120 12[ ) to enhance the 
COSMO-DE ensemble. iBentzien and Friederichd (|2012[ ) consider an enhanced time-lagged 
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ensemble and derive from it different predictors for post-processing. Our approach is in the 
same spirit, but sets itself apart through the distance based weighting scheme, the employ- 
ment of certain ensemble statistics as predictors of the scale of the predictive distribution, 
and in particular through the distinction between ensemble uncertainty and uncertainty due 
to displacement errors. To analyse the role of these scale predictors we depict the temporal 
evolution of all parameters of our EMOS method with a 60 km neighbourhood in Fig. |H The 
bias parameters are very similar to those obtained without neighbourhood information, and 
the shape parameter exhibits the same seasonal cycle with a slightly smaller amplitude (un- 
like in Fig. [2] it is now always positive). More interesting is the role of the two aforementioned 
predictors for the scale parameter, highlighted by the temporal evolution of Pi (ensemble 
uncertainty) and (32 (spatial uncertainty). The latter is constantly higher, suggesting that 
displacement errors are indeed an important factor in prediction uncertainty. Moreover, the 
drop of Pi together with the increase of Pq during the summer months suggests that sources 
of uncertainty behind the processes causing precipitation in summer are still hard to capture, 
while spatial uncertainty is present throughout the year. 

We now study the effect of neighbourhood size on the predictive performance of EMOS 
forecasts, and check whether the climatological correction suggested in Sec. [3] yields indeed 
a notable improvement. Figure [5] depicts the CRPSS and the BSS for different neighbour- 
hood sizes. Apparently, the use of neighbourhood information can improve the predictive 
performance considerably, both with lower and higher thresholds. The biggest improvement 
is obtained for a neighbourhood radius of 60 km for which the CRPSS is about 0.072. The 
climatological correction further increases this value to about 0.076. As can be expected, 
its effect increases with neighbourhood size and it slows the decline of the forecast skill af- 
ter the maximum at 60 km. An analysis of the score differences calculated separately for 
each verification date (not presented here) shows that the correction enhances the benefit 
of using neighbourhood information on the majority of days. However, there are also a 
few days where results would be better without climatological correction, which suggests 
that not every precipitation event is influenced by orographic effects. Is the improvement in 
overall predictive performance statistically significant? Table [2] gives p-values for tests for 
equal performance of EMOS methods using different neighbourhoods. In the first two rows, 
each value corresponds to a comparison of a neighbourhoods with radius r and r — 10 km, 
respectively. The 10 km neighbourhood is compared with the simple EMOS method from 
Sec. [21 For up to 40 km (50 km if climatological correction is used) increasing the radius of 
the neighbourhood significantly increases the performance, while increasing the neighbour- 
hood radius to more than 80 km (90 km if climatological correction is used) results in in 
a significant decline of performance. A direct comparison of the results with and without 
climatological correction shows that the score differences are significant at the 5% level for 
all neighbourhood sizes. Comparing the top and bottom plot in Fig. [5] it may seem surprising 
that even very small score differences are clearly significant. This can be explained by the 
fact that the t-test statistic depends not only on the average score difference, but also on 
the standard deviation of daily differences. The forecasts with and without climatological 
correction are rather similar with the latter being slightly better. This results in very small 
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Figure 5: BSS and CRPSS with the EMOS method and different neighbourhood sizes with 
(bottom) and without (top) chmatological correction. 
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Table 2: P-values obtained by testing for equal performance of the EMOS method using a neigh- 
bourhood with radius r and the same method with neighbourhood radius r — 10 km. The first two 
rows correspond to the variants with and without climatological correction, the last row tests, for 
each radius, if the different between the two variants is significant. 





10 km 


20 km 


30 km 


40 km 


50 km 


60 km 


70 km 


80 km 


90 km 


100 km 


without cl. corr. 


< 0.01 


< 0.01 


< 0.01 


0.019 


0.22 


0.90 


0.41 


0.098 


0.013 


< 0.01 


with cl. corr. 


< 0.01 


< 0.01 


< 0.01 


< 0.01 


0.015 


0.27 


0.98 


0.35 


0.074 


< 0.01 


with vs. without 


< 0.01 


< 0.01 


< 0.01 


< 0.01 


< 0.01 


< 0.01 


< 0.01 


< 0.01 


< 0.01 


< 0.01 



standard deviations of daily score differences and thus renders even small score differences 
significant. Overall we conclude that neighbourhood sizes between 50 km and 80 km yield the 
most skilful proba bilistic forecasts with the m aximum being attained for r = 60 km. This is 
in agreement with iJohnson and Wand (120121) who study nei ghbourhood-based probabilistic 
forecasts using a 48 km radius and iMittermaier et al.l ( 120111 ) who found the high-resolution 
(4 km) Unified Model (MetUM) to become skilful at a scale of about 30 — 35 km (note that 
the effective radius of our approach is smaller than r because gridpoints near the boundaries 
of the neighbourhoods are assigned very low weights). 

We finally take a closer look at the reliability and resolution of our best-performing method 
(the one with neighbourhood radius 60 km and climatological correction) and compare with 
the raw ensemble forecasts and with the EMOS method based on the local forecasts only. 
The confidence intervals in Fig. |6] are rather wide for the high threshold values, indicating 
that both reliability and resolution vary considerably from day to day. Yet, it is apparent that 
both EMOS methods strongly improve forecast reliability, even though deviations from the 
diagonal at the mm and 5 mm level appear to be systematic and might be caused by the lack 
of flexibility that comes with the assumption of a parametric distribution. This assumption, 
however, turns out to benefit probabilistic forecasts at high thresholds: 1075 out of 1136 rain 
gauges report 6-h precipitation accumulations of more than 15 mm on less than 1% of all 
days in 2011. Despite the resulting lack of training cases, our method yields forecasts that are 
reliable and have good resolution. Surprisingly, the use of neighbourhood information does 
not only improve the reliability, but also the resolution of the EMOS predictions. This may 
be somewhat counter-intuitive since the neighbourhood-based statistics smooth the ensemble 
forecast and thus flatten the peaks. An explanation for this might be, that the improvement 
of reliability for many post-processing methods often goes along with pulling the forecasts 
a bit towards climatology. If the reliability is partly improved by moderately smoothing the 
forecasts, the subsequent post-processing correction might be more moderate, so that the 
resolution loss due to smoothing is eventually offset. 
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Figure 6: Reliability diagrams for raw ensemble forecasts (top) and probabilistic forecasts by 
EMOS based on information only (middle) and using a 60 km neighbourhood and climatological 
correction (bottom). From left to right we consider exceedance of 0.1 mm, 5 mm, 10 mm, and 
15 mm precipitation in 6h. The insets show histograms of the log-frequency of cases within the 
respective bins, the bars correspond to 90% confidence intervals obtained by bootstrap resampling, 
and in the bottom right corner 90% confidence intervals for reliability, resolution, and uncertainty 
are given. 
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5 Discussion 



The EMOS method presented in Sections [2] and [3] was shown to be effective for cahbrating 
precipitation forecasts by the COSMO-DE-EPS. It improved rehability while maintaining 
the good resolution of the raw ensemble at threshold values up to 15 mm (6 h)~^. It still 
needs to be tested if the model also performs well with even more extreme thresholds, but 
a much larger verification data set would be required in order to come to authoritative 
conclusions. It might well turn out that further statistics of the ensemble forecasts in the 
neighbourhood (e.g. high quantiles) must be incorporated in the predictive distribution to 
convey information on heavy precipitation events. In the settings considered in this paper, 
even our basic method compared well with extended LR and BMA, two state-of-the-art post- 
processing approaches, and the advanced version using neighbourhood information yielded 
further improvement. We presume that our approach works well also for other ensemble 
prediction systems. 

There are several lines along which our method could still be extended and improved. A 
contentious issue is certainly the fact that we use the same post-processing parameters over 
the whole forecast domain. We do so in order to gather sufficient training data to get stable 
estimates without having to rely on several years of forecast records (which are not available 
in our case). For forecast domains with complex topography or consisting of sub-domains 
with very diverse precipitation climatologies it may be more appropriate, however, to let the 
pa rameters vary spatia lly. This could be realized as an extension of our method analogously 



to iKleiber et al.l ( l201ll ) where post-processing parameters are fitted locally and geostatisti- 
cal methods are used to extrapolate them to locations where observations for model fitting 
are not available. An alternative strategy for locally adaptive post-processing could be the 
inclusion of further predictors such as orographic elevation or orographic slope (or suitable 
transforms of them) with explanatory power for spatial variations in post-processing. 
A further simplification in our approach is that it does not account for pot e ntiall y different 
skill of the ensemble members. This could be done as in iGneiting et al.l (|2005[ ) by using 
the individual ensemble forecasts rather than the ensemble mean as predictors for the lo- 
cation parameter m. We have refrained from doing this here because it would increase the 
number of model parameters from 7 to 26 and entail the danger of overfitting. However, 
while COSMO-DE-EPS mem bers have been constructed such that they all have comparable 



skill (IGebhardt et al.l . l201ll ). overfitting may be a price worth paying if one uses an EPS 



where some members are distinctly less skilled than others. A completely different weighting 
strategy that we plan to investigate in the future could try to make use of the insights by 



Keil and Craigl (120111 ) who found that both spread and skill of the different subgroups of 



members in COSMO-DE-EPS depend on the meteorological situation, i.e. on whether pre- 
cipitation is stratiform, due to equilibrium or due to non-equilibrium convection. One might 
even go one step further and check if the seasonal cycle of our post-processing parameters 
(see Fig. S]) can actually be attributed to seasonal variations in the frequency of weather 
regimes. In this case further improvement of the reliability of probabilistic forecasts might 
be possible if training days are selected by similarity of meteorological situations rather 



20 



than temporal pro ximity. In a slightly different setting this has been done successfully by 



Kober et al.l (|2012[ ) and their ideas might be transferable to our framework. 



Finally, there is the issue of spatial consistency, which becomes important when the interest 
is in forecasting spatially aggregated quantities like the overall precipitation amount in some 
river catchment, rather than precipitation amounts at individual sites. The EMOS method 
presented s o far only yields univ ariate predictive distributions, but approaches similar to 
the one by iBerrocal et al.l (120081 ) could be used to extend it to a multivariate distribution 
model which can b e used to simulate calibrated and spatially consistent precipitation fields. 



Sigrist et al.l (120 12[ ) present a spatio-temporal model that accounts for dependencies between 



both multiple locations and multiple look-ahead times, and can model phenomena such as 
transport and diffusion. A combination of their ideas with the approach presented here may 
be able to attain the ultimate goal of producing calibrated, sharp, and physically realis- 
tic probabilistic forecasts of spatio-temporal precipitation fields, but quite some effort still 
seems to be required to realize this in such a way that the resulting method is suitable for 
operational use. A much simpler approach whic h could be used right away and may also 
yield good results is described in ISchefzikI (120111 ): ensemble copula coupling (ECC) can be 
combined with any univariate post-processing method and transfers the dependence struc- 
ture of the raw ensemble to samples from the univariate predictive distribution. The EMOS 
metho d for precipitat i on pr esented in thi s article complements existing raetho ds for temper- 
ature (IGneiting et al.U2005[ ). wind spe ed (iThorarinsdottir and Gneitind. 120101) . wind vectors 
(ISchuhen et al.l . |2012[ ) and wind gusts (IThorarinsdottir and Johnsonl . l2012l ). It is reasonably 
simple and can serve as a basis for future developments along the lines described above. 
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